Efficient Image Captioning Based on Vision Transformer Models

نویسندگان

چکیده

Image captioning is an emerging field in machine learning. It refers to the ability automatically generate a syntactically and semantically meaningful sentence that describes content of image. requires complex learning process as it involves two sub models: vision sub-model for extracting object features language use extracted captions. Attention-based transformers models have great impact recently. In this paper, we studied effect using on image by evaluating four different transformer sub-models The first used DINO (self-distillation with no labels). second PVT (Pyramid Vision Transformer) which not convolutional layers. third XCIT (cross-Covariance changes operation self-attention focusing feature dimension instead token dimensions. last one SWIN (Shifted windows), which, unlike other transformers, uses shifted-window splitting For deeper evaluation, mentioned been tested their versions configuration, evaluate model five backbones, versions: PVT_v1and PVT_v2, XCIT, transformer. results show high effectiveness SWIN-transformer within proposed regard models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phrase-based Image Captioning

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representat...

متن کامل

Can Saliency Information Benefit Image Captioning Models?

To bridge the gap between humans and machines in image understanding and describing, we need further insight into how people describe a perceived scene. In this paper, we study the agreement between bottom-up saliency-based visual attention and object referrals in scene description constructs. We investigate the properties of human-written descriptions and machine-generated ones. We then propos...

متن کامل

Language Models for Image Captioning: The Quirks and What Works

Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a...

متن کامل

Domain-Specific Image Captioning

We present a data-driven framework for image caption generation which incorporates visual and textual features with varying degrees of spatial structure. We propose the task of domain-specific image captioning, where many relevant visual details cannot be captured by off-the-shelf general-domain entity detectors. We extract previously-written descriptions from a database and adapt them to new q...

متن کامل

Convolutional Image Captioning

Image captioning is an important but challenging task, applicable to virtual assistants, editing tools, image indexing, and support of the disabled. Its challenges are due to the variability and ambiguity of possible image descriptions. In recent years significant progress has been made in image captioning, using Recurrent Neural Networks powered by long-short-term-memory (LSTM) units. Despite ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computers, materials & continua

سال: 2022

ISSN: ['1546-2218', '1546-2226']

DOI: https://doi.org/10.32604/cmc.2022.029313